fix(mcp): reap serve --mcp child when parent is SIGKILL'd (#277)#286
Conversation
…ry#277) On Linux the kernel doesn't propagate parent death to children, and the existing `stdin.on('end' | 'close')` handlers don't always fire when an MCP host (Claude Code, opencode, …) is force-killed by the OOM killer, a `kill -9`, or a container teardown. The reporter in colbymchenry#277 ended up with three orphan `codegraph serve --mcp` processes pinned across sessions, each holding its own inotify watch set (~440k watches), which then tripped colbymchenry#276's watch-budget exhaustion in unrelated tools (Next.js, IDEs). Capture `process.ppid` once at server construction, poll it on a `setInterval`, and shut down the moment it diverges from that baseline. The interval is `.unref()`'d so it never holds the event loop open on its own; the poll period is `CODEGRAPH_PPID_POLL_MS` (default 5000ms, `0` disables for embedded hosts that re-parent on purpose). The regression test stands up a four-tier process tree (vitest → wrapper → {stdin-holder, codegraph}) so the wrapper's SIGKILL doesn't transitively close codegraph's stdin (sibling stdin-holder keeps the pipe's write-end alive). That isolates the watchdog from the pre-existing stdin-close path: the test fails without the watchdog and passes with it.
The watchdog added for colbymchenry#277 watched only process.ppid. On current main, the --liftoff-only re-exec inserts an intermediate process between the MCP host and the server; that intermediate outlives the host (blocked in spawnSync), so the server's own ppid never changes when the host dies and the watchdog never fires — the regression test fails on main. Propagate the host PID across the re-exec (CODEGRAPH_HOST_PPID) and have the watchdog poll it for liveness, keeping the ppid-divergence check for the direct (bundled) launch path. Validated: watchdog test passes with re-exec active, and an A/B (watchdog on vs off) reaps the orphan only with it on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Merged — thank you @evanclan! 🎉 Genuinely one of the best-documented PRs I've reviewed: the four-tier process-tree test (dup'ing a sibling's stdout into codegraph's stdin so SIGKILL'ing the wrapper doesn't close stdin) isolates the watchdog perfectly. One thing surfaced during validation that needed a fix on your branch before merge: main moved under the PR. Since you opened it, main gained a Fix (commit on your branch, credited to you via squash): propagate the host PID across the re-exec via Validation on macOS (reparent-to-launchd mirrors the Linux case):
Thanks again — great contribution. 🙏 |
Summary
Adds a
process.ppidwatchdog toMCPServer.start()so acodegraph serve --mcpchild terminates when its MCP host is force-killed. Resolves #277.Problem
The existing shutdown path (
src/mcp/index.ts) leans entirely on signal handlers and stdin close events:```75:79:src/mcp/index.ts
process.on('SIGINT', () => this.stop());
process.on('SIGTERM', () => this.stop());
```
On Linux that's not enough when the host (Claude Code, opencode, …) is SIGKILL'd by the OOM killer / a `kill -9` / a container teardown:
Solution
Capture `process.ppid` once at construction, then `setInterval` (default 5s, `.unref()`'d) to check it. The moment it diverges from the baseline, we know the original parent has died and we tear down cleanly:
```text
[CodeGraph MCP] Parent process exited (ppid 9177 -> 1); shutting down.
```
Cross-platform: reparenting changes `process.ppid` on Linux and macOS; on Windows the value drops to 0 once the parent is gone, which also trips the check.
Knobs:
`stop()` is now guarded by an idempotency flag so the watchdog can't race the existing stdin-close handlers and double-close the SQLite handle / transport.
Out of scope
Test plan
Related
Made with Cursor